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Abstract 

Is visitors' attendance a fair indicator of a web site's quality? Internet sub-domains 
are usually characterized by power law distributions of visits, thus suggesting a 
richer-get-richer process. If this is the case, the number of visits is not a relevant 
measure of quality. If, on the other hand, there are active players, i.e. visitors who 
can tell the value of the information available, better sites start getting richer after 
a crossover time. 
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Over the last decade, the increasing use of Internet and number of websites 
have made available to a vast audience huge quantities of information. Under 
such tremendous information overflow, two of the most important problems 
are: {i) what is the characteristic of such systems? and (u) how does one find 
relevant information in the ocean of the WWW? The challenge of finding 
relevant information is not new, well before the advent of Internet. Take the 
bestsellers for books, some readers would buy books which are high on the rank 
list, thus enhancing the highly ranked books standing; some other would buy 
only if it genuinely passes their criteria, regardless the ranking. The former 
group of readers is said to be passive and the latter active. What is true 
for books must also be valid for movies and consumer products and services, 
political candidates, and myriad of things in our modern social economic life. 

It's clear that if all of the players are passive, what happens would be a richer- 
get-richer scheme, and the Barabasi-Albert model provides the paradigmatic 
example for a broad class of phenomena [1] . Our work below is an attempt to 
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introduce a certain number of so-called active players who know better what 
they want, not duped by the possibly misleading signal, find the targeted 
object and her action hence can enhance the ranking of the otherwise ignored 
item. We are particularly interested in the ratio of passive to active players 
and the outcome of information selection capability. In the real world most of 
us are passive and on a few occasions we are active. This is because that in the 
normal conduct of daily life, we cannot be expert in all subjects, we necessarily 
rely on others' information selection capability to find what we want. So in 
this connected world we want to model that on each specific niche there are 
some active players, while all others are passive, following a mechanism akin 
to the BA preferential attachment. 

Like many other quantities characterizing social networks, web site attendance 
[2] and connectivity [3] seem to be power-law distributed. The BA model [4] 
is believed to explain the fundamental mechanism underlying evolving net- 
works, but does not account for the selection of valuable information. This 
can be achieved by assuming web sites have a scalar intrinsic quality which 
people can, to a certain extent, take into account besides their popularity [5]. 
Nevertheless, we are aware that approximating the quality of a web page by a 
scalar is only adequate when comparing web pages under the same category. 
Otherwise, it will be as inappropriate as providing an absolute ranking among, 
say, physicists and apples. The motivation of the current work is based upon 
two fundamental observations that have not received much attention so far. 

The first one is: the correlation between the popularity of a web site and its 
quality emerges from the interplay of heterogeneous visitors. In fact it is known 
that old sites do enjoy an edge over new and less popular ones. That is due 
to the fact that most visitors are passive, i.e. they are easily infiuenced by 
advertising, word of mouth or a web site's rank in search engines. There are, 
on the other hand, some people for whom a given information is, for some 
reason (it may concern their core business or their main hobby), of great 
importance. They will spend a great deal of resources (their money, time and 
capability) searching for it, and they will reward good sites regardless of how 
famous they are. These active visitors, although a minority, are responsible 
for the selection, and eventual popularity, of good web sites. The ecology of 
active and passive players has already been dealt with in different contexts 
[6]. The second observation is: there is no intrinsic reason why social networks 
should display a power law with a given exponent forever, we have no control 
over the changes that the parameters governing it may undergo. Furthermore, 
models aiming to describe a network do not need be asymptotically scale free, 
but they might have a crossover between different regimes. 

Because we would like to address the quality issue and are aware of the lim- 
itation of scalar representation for quality, we will study a simple stochastic 
model that is applicable to attendance statistics of web pages under the same 
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category, i.e., the quality of each web page will be represented by a scalar 
quantity and the active visitor perceives the quality indicator of each site. 
To model a larger system with various categories, we have studied a simpli- 
fied situation where only passive visitors are considered. This is to bypass the 
more complicated mathematics needed to model the cross-category quality as- 
sessment but still hope to capture the early-time statistical properties of the 
network. The omission of active visitors is not a severe drawback here because 
each visitor can probably be active in only a few categories and the statistical 
properties before the active opinions become amplified are mostly infiuenced 
by queries from passive visitors. 

The paper is therefore organized as follows. In section one we describe a simple 
stochastic toy model of web page attendance. Players can be either active or 
passive, the precision of their activity being determined by an external tem- 
perature. The model displays a power law distribution of web page visits and a 
crossover between two regimes, where the choices of active players are more or 
less influential. In section two we shall analyze particular mean field instances 
of the model, with web sites' qualities respectively delta and uniformly dis- 
tributed, where the stochastic part has been averaged out. In section three 
we sketch the analytical solution of the original model. In section four, while 
suppressing the presence of active visitors, we model the system of multiple 
categories by using a hierarchical geometry. The final section documents our 
conclusion and some remarks. 

In the following we shall use interchangeably the terms web site and web page. 
They are often referred to as nodes of a graph, in the language of networks. 



1 Active and passive players: a stochastic model 



We would like to describe now a simple model of web page attendance in a local 
network. The number of web sites N is fixed in the time-frame of the visits 
dynamics. Each one of them is endowed with an intrinsic quality ej, distributed 
according to a given function p{e). Such a scalar fitness is suitable to compare 
sites belonging to the same domain. At each time step t a new player places 
a query. In order to account for visitors' heterogeneity in a minimal model, 
we consider the case in which players can be included in two different classes: 
active and passive. With probability p the player will be active, driven by the 
quality of web sites. Although his decision be affected by a noise, he visits site 
i with average probability 
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where (3 is an external parameter which plays the role of an inverse tem- 
perature. Active players may be thought of as experts of the domain under 
consideration. With complementary probability (1 — p), on the other hand, 
the player is passive, driven by the popularity of web sites. His probability Qi 
of visiting site i follows linear preferential attachment: 



where nj(t) is the number of visits to site i at time t. Notice that Yl!j=i — t 
for we have one visit per time step. 

The following stochastic equation is intended to mimic the model just de- 
scribed: 

n,{t + 1) - n,{t) = (1 - p)!^ + pi,{t) (3) 



with initial conditions 

n,(0)=0, n,(l)=ei(0). (4) 

The stochastic noise ^ is distributed as follows: 

p{Ci)^PiSiCi-i) + Ci--Pi)Sm- (5) 

The first term on the rhs of (3) accounts for the "richer get richer" phenomenon 
due to passive players using preferential attachment (2). The second term is 
stochastic and describes the behavior of active players. On average they employ 
probability (1), but they are only allowed to pick one site when they come into 
play. Therefore the noise term must be normalized as follows: 

N 

j:m = h'^t. (6) 
1=1 



2 Mean field results 



If we average equation (3) over the noise, making use of (5), and take the 
continuous time limit, we obtain the corresponding mean field evolution equa- 
tion. We shall analyze it for two significant instances of the distribution of 
web sites' quality p(e). 
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2.1 Random choice 



Let us consider a network where all web sites share the same quality value 
ej = e. Since, in this case, the preference probability of active players (1) 
becomes Pi = 1/N, they actually place random queries. The evolution equation 
reads in this case: 



drii Hi p 



(7) 



with initial conditions n{t,ti) = Wt < and ni{ti, ti) 
of the first visit to site i. For t > t- the solution is: 



1. Here is the time 



ni{t,ti) = — 



(8) 



The statistics of the first-time is well described by the probability that a 
site i is not searched by active players up until time ti, i.e. 



^^'^ N\ n) - 



where the last formula holds in the large N limit. The average number of visits 
to a site is given by 



< n{t,ti) >t. = J n{t,x)P{x)dx 



t_ n_p\ ^-p 

N \N. 
t 

TV' 



— r(p+i) + r(p) 
p 



as expected. Upon separating the terms due to the action of active and passive 
players in (8) and equating them, one finds the crossover time 



t ~ 



(9) 



We have performed numerical simulations of this model, gathering data at 
time T — 500. In fig. 1 we plot the probability distribution of visits for three 
different values of p = 0.1,0.9,0.5, whose corresponding crossover times (9) 
are tc = 26N,2.5N,4.5N. After T = 500 time steps, therefore, the system of 
N = 100 sites is expected to be found, respectively, in the passive dominant, 
in the active dominant and around the transition region for the three p values 
chosen. In fact we observe that for p — 0.1, when passive players are the 
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10^ 



p=0.1 




Fig. 1. Simulation results of the model described in (7) for p = 0.1,0.5,0.9 with 
N = 100 web pages at time T = 500. Data are averaged over 500 runs. The solid 
lines are fits to the data: power law for p = 0.1 and gaussian for p = 0.9. 

majority, the density P{n) of sites with a given number n of visits decreases 
as a power-law of n, whereas for p = 0.9, when active players are the majority, 
it follows a gaussian distribution. Finally, for p — 0.5, we observe a more 
complex intermediate behavior. 



2.2 Uniform quality distribution 



We shall now analyze the continuous time limit of equation (3), averaged with 
respect to the noise distribution (5), with uniformly distributed between 
zero and unity. The corresponding evolution equation reads: 



with initial conditions n{t,ti) — yt < ti and ni{ti,ti) — 1. For t > ti the 
solution reads 

/ t\'^~p 
ni{t,ti) ^Pit+{1 - tiPi) - 



As in the random case, this model displays two different regimes. The crossover 
time can be similarly computed: 

tc^N ^ ^! (11) 



For t ^ tc the dynamics is dominated by active players, who select valuable 
web sites. In the p — 1.0 limit, in particular, we only have active players in 
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Fig. 2. Probability of having n visits at times T = 500 (a) and T = 10^ (b), 
following equation (10). Simulations are performed for a system oi N = 100 web 
pages, with /3 = 10 and p = 0.1,0.51.0. In both graphs the power-law expected 
behavior for p = 1.0, P{n) ^ 1/n (12) is represented by a solid line. 

the system. The average number of visits to a given web page i reads 

where A = J2iLi ^^^^ ■ The probabihty distribution of the number of visits 
satisfies the relation P{n)dn — p{e)de which, together with (2.2), gives the 
power law distribution 

P{n) - -. (12) 



We have simulated this dynamics for different values of (3 and in the entire 
range of the p's. In Fig. 2 we report distribution P{n) at small times T — 500 

(graph(a)) and at longer times T = 10'"' (graph(b)). The solid lines plot the 
theoretical curve (12) expected to be valid in the p = 1.0 limit. 

In order to measure how much the differences in the quality of web pages is 
reflected in the number of visits they receive, we have ranked them in order of 
decreasing number of visits and of increasing quality. We have then measured 
the probability P^'"'' a web page ranked r in the ranking of the number of visits 
would coincide with the web page ranked r in the quality ordering. In order to 
measure P'"''' we have ran the visit dynamics for 500 time steps and we have 
calculated the number of times the web pages ranked r in the two possible 
orderings were the same. In fig. 3 we plot simulation data for different p's at 
number of visits T = 500 and T = 10^. In fig. 4 we plot the probability that 
the web page that takes the maximal number of visits is the best one, i.e. P^'^, 
as a function of p. 
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Fig. 3. Probability P^'^ versus r in model (10). Data are shown for a system of size 
N = 100 web pages, /3 = 10 and number of visits T = 500 and T = 10^ respectively 
for graph (a) and graph (b). Average taken over 500 runs. 
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Fig. 4. Probability P^'^ that the most visited web page is the fittest one in model 
(10). Simulations have been performed over a system of = 100 web pages at 
T = 500 (graph (a)) and T = 10^ (graph (b)) number of visits. Probabilities are 
calculated over 500 runs. 

3 Analytical results for the stochastic model 



Now we turn to our original model, described by equation (3) with initial 
conditions (4), and try to solve it making use of methods similar to those 
outlined in references [8]. To this end it is useful to define the generating 
functions: 



t=i 

oo 



5^(A) = EA^e.W, 
t=i 



(13) 
(14) 
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where gi{t) — In fact, if we multiply (3) by A* and sum over t, we obtain 



(1 - X)dxGi{X) = (1 - p)Gi{X)+pEi{X)+^,{0), 
which admits the following formal solution 







1-p 



(15) 



Comparison between the small A development of (15) and definition (13), 
yields 



t-i 



s=0 



with coefficients 



-i)*+^r(p + t-i) 
t\ r(p) 
r(s + i)r(p + t- 1) 



r(t+ i)r(p + s) 



(16) 
(17) 



3.1 Probability distribution 



We are now able to write a formal expression for Pt{{gi}), the probability 
density function of having a certain set of gi-s {i = 1,2, ...,N) at time t: 

» / t-l \ t~l N / N \ 

Ptm) = / slg^{t) -j:us)A,,t{s)] i[iidut')put')) hr[j:m 

•' \ s=0 J t'=Oj=l \l=l ) 



We can now employ the Fourier representation of the Dirac delta function 



-ak] ^ikAgi-Y^ ii{s)Ap^t{s) 



s=0 



inside (18). Thus we can separate the noise and integrate it out: 



N 



27r 



k=l 



1=1 

t-1 N 



N 

Yl / dkiexp{-akf + ikgi] 



s=Oh=l ^ Ph 
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By rewriting the last term of the integrand as a product of sums (instead of a 
sum of products) and taking the hmit a ^ 0, the former expression becomes: 



Here {k} = ki, k2, kn represents a particular outcome of the game, in which 
site 1 has been visited ki times, site 2 k2 times, and so forth. The symbol 
T({k}) stands for the set of time sequences Tn (the time step at which site / has 
been visited for the i*'* time) with a given set of k. There are M({k}) — ^Tfr — 

lli=l ^i^- 

such sequences. 

Having in hand the complete probability distribution (19) with coefficients 
(16) and (17), all quantities of interest can be calculated with projection tech- 
niques. 

3.2 Probability that the best wins 

As an example of quantities that can be calculated, we shall find the prob- 
ability P/'^ that the site with the best fitness has the greater number of 
visits at time t. Defining the events Wk = {gkit) > Qiif) Vi ^ k) and Ek = 
(efe(t) > ei{t) Wi k), we can write 




(19) 



TV 



Y,Pt{Wk\E,)p,{E,) = NP,{W,,E,). 



k=l 



The joint probability above reads 




(20) 



If we now plug equation (19) into equation (20), we obtain: 



1,N 



Pt{Wi,Ei) 



E 



M{k})ftm) 



(21) 
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I N 91 jv 



» iV y JV / Kl 

M{k})= E dg^U dg,l[d[gi-Y.A,,{Ti, 

T({fc})0 *=2o '=1 V ^=0 



i=2 



(22) 



■ (23) 



While the integral in (22) is straightforward, i.e. 



N 



M{k})= En© 

T({fc})/=2 



i=l 



i=l 



(24) 



the one in equation (23) needs some approximation to be solved. 



3.2.1 Calculation of ft{{k}) 

Let us assume the are uniformly distributed between zero and one. Defining 
the transformation of variables 



JV 



0ei 



N 



A = E 



1=1 



equation (23) can be written: 

JVe" 



Mm 



JVC e."+iV-i N N / ^ \ 

^ / dAA^-*-^ / dy, I Udy,U[y^'-'i^-y^r'']^[^-l:y^)■i^^) 



Using the integral representation of the delta function 



(N \ 
1-El/^j = / dqe-<'-^l^y^) , 



one can evaluate the integral (25) by means of the saddle point method, where 
the variable q will play the role of a Lagrange multiplier. Integrand maximiza- 
tion yields 



kn — 1 qiki — l){t — ki) „ 
yi-^-r + \. ^. for g < t - 1 



1 



(t - 1)^ 
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^ t'-tN+t-Ei,kr 

the former result is, therefore, vahd only if t ':$> N. In this limit the approxi- 
mate solution reads: 

1 ^ 

M{k}) = j^^ih - t/N) n t-'ii - yiY-'^Qih - h). (26) 



In the opposite limit, that of large A^, one can use the law of large numbers 
over the fitness distribution, i.e. 



N 



1=1 



^ A= ^(e^- 1) for/?>0 



and 



N 

n 

1=2 



t-ki 



ct({^})=exp 



N 



= exp 



J2Ph <ei> -Uog A + {t - A;,) (log (A - e^^^)) 



EPki A I ^i) ^ e^' 1 



(27) 
(28) 



Here the angular brackets stand for averages over the distribution of the e^-s. 
Hence: 



1 - e 

t-k-i 



I3e\ i-'^l 



(29) 



■ctiik}) 



e (AA;i - t) , (30) 



J J \ ^ J V 

where the saddle point method has been employed to solve the last integral. 



4 Statistics of visits on a ultrametric argument space 



We now turn our attention to the entire World Wide Web. While inside a given 
category it is easy to compare different sites, the WWW deals with various 
arguments, sometimes not overlapping at all. Scalar fitnesses should, therefore. 
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be replaced by vectors and players could only be active in the domains they 
are experts of. 

Here we would like to introduce a simple hierarchical structure of Internet cat- 
egories, taking only into account passive players employing different research 
efforts. Inspired by a recent work on social networks [7] we place = 2^ 
sites on the leaves of a ultrametric tree with M levels. Upon labeling them 
sequentially from to iV — 1, the ultrametric distance between two web pages 
i and j can be defined as the greatest exponent d such that 

[i/2'^'] = \j/2'^'] for d! > d, 

where [a] denotes the integer part of a. At each time step a visitor places a 
query (say i G [0, iV — 1]) and extract distance d from a distribution p{d), such 
that all the nodes within a radius d from the query are eligible answers. We 
assume that the probability that a site j receives a visit as a consequence of 
a given query i is driven by the generalized preferential attachment rule [9] 

Uj(xe{d{i,j)-d){nj + l)''. 

Here the constant one represents the initial attractiveness of each node, which 
is necessarily non-zero when there are no active players. Since the system 
is not growing and the number of sites is fixed, as in the quasi-static scale 
free networks [10], simple linear preferential attachment is not enough for the 
number of visits to be power-law distributed. 

Let us first analyze the model when is uniform. In figure 5 the number 
of visits n of a given web page i is plotted versus its rank r{i) = A^ZIJ^ P{i^i)- 
For q; > 2 we observe a power-law behavior of n{r) 

n{r) ~ (31) 

corresponding to a power-law P{n) ^ n « in the probability density of the 
number of visits. For lower values of the exponent a (including the special 
case of linear preferential attachment) the power-law (31) disappears. 



Dependence from the specificity of the request 

We now analyze the role of distribution p(c?), which modulates the proportion 
of players who place a very specific query with short distance d and that of 
players who are much more easily satisfied, looking for popular sites within a 
wider radius. In particular we have considered the two distributions 
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Fig. 5. Number of visits n{i) versus rank in the ultrametric model with uniform 

radius distribution, for different values of a. The simulations are performed on a 
system of 128 webpages and the data are averaged over 100 runs. For a = 2 the 
best fit to the data (indicated with solid line) is a power law with exponent ^ c± 1.2. 
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Fig. 6. Effect of the variation of p{d) on the rank distribution of visits in a system 
of AT = 128 nodes. In graph(a) the model studied has linear preferential attachment 
(a = 1), in graph(b) the model studied has a = 2. Simulations are carried out for 
T = 10^ time steps. The two bottom lines are instances of distribution (32), the 
two top ones of (33). The instance with uniform distribution has also been drawn 
for comparison. 



p{d)^{K + l){d/M)^ 

pid)^i0 + i){i-d/My. 



(32) 
(33) 



In the first case queries that require a more precise answer (small distances d) 
are less frequent than queries that require a less precise answer. The reversal 
applies to the second case. 

We start with a system of web pages in which the visits dynamics is driven by a 
preferential attachment with a — 1. The impact of the variation of distribution 
p{d) is illustrated in Fig. 6a, where the rank distribution of visits is displayed 
in linear scale. Indeed the curve n{r) becomes steeper as visitors increase their 
search radius. 



The role of distribution p{d) becomes crucial for greater values of a. In partic- 



14 



ular in Fig. 6b we consider the case a = 2. For a distribution of the type (32) 
the power-law functional form of n(r) (Eq. 31) breaks down, leaving place to 
the logarithm of the rank n{r) ~ log(r). For distribution (33) the scaling (31) 
is conserved, but the curves get steeper with increasing 9 values. 



5 Conclusions 



We proposed a simple model of web pages attendance, focusing our attention 
on webpages that are found in the same category. We investigated correlations 
between the number of visits they receive and their quality, as it emerges from 
the interplay of heterogeneous players: the passive players, driven by the pop- 
ularity of the web page and its advertisement, and active players, driven by 
their own information of the sites' intrinsic quality. We studied the model 
by numerical simulations and by analytical calculations. Connectivity statis- 
tics follows power laws with different slopes, but a typical length scale might 
occasionally appear, as in fig. 1. When a scalar indicator characterizes the 
intrinsic quality of web sites, experts participation can improve the effective- 
ness of Internet searches (fig. 3 and 4). In fact the model displays a cross-over 
from the passive dominated phase to the active dominated phase, in which 
the correlations between quality and visits build up. It would be interesting 
to know where the actual Word- Wide- Web is placed respect to this cross-over 
point and how much the popularity based ranking, still used by some search 
engines, really refiects the ideal quality ranking of the web pages. 

We conclude the paper with a discussion of how the websites attendance varies 
in a system with multiple categories. We formulate a hierarchical model in 
which we define a distance (the ultrametric distance) between different pages. 
In this model there are no active players that visit their chosen web page but 
there are only passive players that visit web pages distant less than d from 
a given query. We have shown that the more generic is the search (the wider 
the radius d) the steeper is the distribution of the number of visits while the 
more specific is the search (the smaller is d) the smoother is the distribution 
of visits. 
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